arXiv:1506.02194vl [stat.ML] 6 Jun 2015 


Fast Mixing for Discrete Point Processes* 


Patrick Rebeschini patrick.rebeschini@yale.edu 

Amin Karbasi amin.karbasi@yale.edu 

Yale Institute for Network Science, 17 Hillhouse Avenue, Yale University, New Haven, CT, 06511, USA 


Abstract 

We investigate the systematic mechanism for designing fast mixing Markov chain Monte Carlo 
algorithms to sample from discrete point processes under the Dobrushin uniqueness condition 
for Gibbs measures. Discrete point processes are dehned as probability distributions fi{S) oc 
exp(/3/(S')) over all subsets S' G 2^^ of a bnite set V through a bounded set function / : 2^ —>■ K 
and a parameter /3 > 0. A subclass of discrete point processes characterized by submodular func¬ 
tions (which include log-submodular distributions, submodular point processes, and determinantal 
point processes) has recently gained a lot of interest in machine learning and shown to be effective 
for modeling diversity and coverage. We show that if the set function (not necessarily submodular) 
displays a natural notion of decay of correlation, then, for /3 small enough, it is possible to design 
fast mixing Markov chain Monte Carlo methods that yield error bounds on marginal approxima¬ 
tions that do not depend on the size of the set V. The sufficient conditions that we derive involve 
a control on the (discrete) Hessian of set functions, a quantity that has not been previously consid¬ 
ered in the literature. We specialize our results for submodular functions, and we discuss canonical 
examples where the Hessian can be easily controlled. 

Keywords: Discrete point processes, MCMC, fast mixing, submodular functions, decay of corre¬ 
lation, Hessian of set functions 


1. Introduction 

Probabilistic modeling and inference techniques have become essential tools for analyzing data and 
making predictions in a variety of real-world settings. Graphical models (Wainwright and Jordan, 
2008) have provided an appealing framework to expressed dependencies among variables through 
a graph structure. A broad class of such models that have been widely used in machine learning is 
represented by Markov random fields, where the probability distribution of a collection of n random 
variables X := ..., X^) is defined as a product of non-negative potentials over maximal 

cliques C C G of an undirected graph G = {V, E), i.e., /r(x) := P(A = x) = 1/Z cj)c{x^)^ 

where we adopted the notation := {x^}i^c- Here, Z is the normalization factor, often called 
the partition function, and it is known to be hard to compute exactly (Jerrum and Sinclair, 1993), or 
even to approximate (Goldberg and Jen'um, 2007). Perhaps the most prominent example of Markov 
networks, with many applications in machine learning, is the pairwise Markov random field, also 
called Ising model, where cliques are defined on edges between pairs of variables. 

These examples can be seen as instances of a general class of probabilistic models that we refer 
to as discrete point processes, which are defined as 

//(5) = 2exp(/3/(5)) for^CH, (1) 

* This is the full version of a paper in the 28th Annual Conference on Learning Theory (COLT), 2015. 
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where f : 2^ ^ W is abounded set function, Z := Ylscv partition function, 

and /3 is a strictly positive real constant that parametrizes the distribution. Discrete point processes 
have been widely studied in mathematics and statistical physics (Daley and Vere-Jones, 2007)[ch. 
5], where they have been traditionally used to model particle processes and neural spiking activity, 
among many others. Note that distributions over subsets of V are isomorphic to distributions of 
n := \V\ binary random variables X ^,..., X” G {0,1}. 

A subclass of discrete point processes, referred to as log-submodular (log-supermodular) dis¬ 
tributions, has recently been investigated in Djolonga and Kr'ause (2014), where the set function 
/ is taken to be submodular (respectively, supermodular), i.e., characterized by the property that 
the difference in the value of the function when an element is added to a set, the so-called marginal 
gain, decreases (increases) as the cardinality of the set increases. Throughout this paper, the discrete 
derivative, or marginal gain, of / is defined as Aj/(5) := f{S U {f}) — f{S). The function / is 
submodular (supermodular) if for any S C S' C V and any i ^ V\S',it holds Aif{S) > Aj/(5') 
(Ai/(5) < Aif{S')). Under some regularity conditions (discussed in Djolonga and Krause (2014)), 
pairwise Markov random fields are also a special case of log-submodular point processes, as are de- 
terminantal point processes (Kulesza and Taskar, 2012), (Hough et ah, 2006). Here, the submodular 
function is defined as f{S) = logdet(L 5 ) where L G is a positive definite matrix and Ls 

is the square submatrix of L that is indexed by S. Determinantal point processes have been exten¬ 
sively used in physics and machine learning to model negative correlations, giving rise, for instance, 
to diverse sets of items in recommendations. Another related subclass of discrete point processes, 
called submodular (supermodular) point processes, has been recently proposed in Iyer and Bilmes 
(2015), where /u(S') oc f{S) with / a non-negative submodular (supermodular) function. 

The diminishing return property that characterizes submodularity — which makes submodu¬ 
lar functions suitable for applications in several fields, ranging from economics to machine learn¬ 
ing — has been extensively investigated in the domain of optimization (see Krause and Golovin 
(2014) for instance), but its role has yet to be established in the realm of probabilistic inference. 
Iyer and Bilmes (2015) showed that, in general, computing the partition function in log-submodular 
distributions and submodular point processes requires exponential complexity. Djolonga and Krause 
(2014) and Iyer and Bilmes (2015) resort to variational approaches to approximate the partition 
function. In Djolonga and Krause (2014) the authors provide upper and lower bounds based on 
sub- and super- gradients (Iyer et ah, 2013), showing that the log partition function \og{Z) can be 
approximated within 0{n). In their setting, however, this implies that the error bounds that can be 
derived from their theory to approximate marginals of the type n{{S C V : S 3 f}), fora given 
z G U, deteriorate exponentially with the size of the model n = |U|, as can be deduced from their 
experimental results. In addition, the bounds that they consider depend on the curvature c{f) of 
the submodular function /, a quantity between 0 and 1 that characterizes the deviation from mod¬ 
ularity (/ is said to be modular if A*/(5) = A*/(5') for all S, S' C V, i ^ S, S', and in this 
case c(/) = 0). However, there are trivial examples with very little interactions between random 
variables (Section 4.1) for which the inference problem has a straightforward solution but c(/) = 1, 
so that the corresponding (lower/upped) bounds on the partition function in Djolonga and Krause 
(2014) are unbounded and no useful inference can be deduced. In contrast to Markov random fields, 
whose partition function is typically intractable and hard to approximate, determinantal point pro¬ 
cesses admit an exact algorithm for marginalization, albeit in time cubic in the size n. In order to 
avoid this cost, recently Kang (2013) considered a Markov chain Monte Carlo (MCMC) algorithm 
to sample from determinant point processes. The author claims that this algorithm is fast mixing 
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(i.e., the Markov chain gets arbitrarily close to equilibrium after a small, 0(nlog n), number of 
steps) with no additional assumptions on the model. However, the proofs of the results in this paper 
are wrong as they rely on ill-defined couplings between Markov chains. 

In this paper we address the inference problem of approximating marginals of the type ^{{S C 
H : 5 9 i}) for discrete point processes defined in Eq. (1) through MCMC methods. In Section 
2 we define the general class of local-update algorithms that we consider, and we present the main 
theoretical result for fast mixing, i.e.. Theorem 1, which relies on the Dobrushin uniqueness con¬ 
dition for Gibbs measures (Dobrusin, 1970), (Georgii, 2011). In Section 3 we analyze two specific 
algorithms within this class, Gibbs sampling and Metropolis-Hastings, and we show that if the set 
function / satisfies some natural notion of decay of correlation, then these algorithms are fast mix¬ 
ing and yield size-free error bounds on marginal approximations. The decay of correlation property 
that we exploit concerns the decay of the absolute value of the difference of marginal gains evalu¬ 
ated at sets differing only by a single element j, as a function of the element i being added (in fact, 
here the role of i and j is symmetric). More precisely, this property is related to the second order 
derivatives AjAif{S) = Aj/(5 U {j}) — Aj/(5) = AiAjf{S), i.e., to the (discrete) Hessian of 
the function /. If we define Mij := m.ayisQv.s^i,] |AjAj/(S')| for each i j, and Mu := 0 for 
each i, we show that the Gibbs sampler is fast mixing if the following condition holds 

a{/3)j3\\MWoo < 7 < 1 , 

where ||M||oo := maxi^v J2j(^v “(/^) ■= maxi^y max5ci/\{i} and 7 > 0 is a 

quantity that does not depend on n. If the set function is submodular, i.e., the distribution is log- 
submodular, then we can simplify this condition (Lemma 8). To the best of our knowledge, these 
results are the first to emphasize the importance of the Hessian of set functions, a quantity that has 
not been previously investigated even in the optimization domain. Finally, in Section 4 we spe¬ 
cialize our results for a number of canonical examples of submodular functions (facility location, 
cut function, log determinant functions leading to determinantal point processes, and decomposable 
functions). These examples attest that our general criterion, which a priori involves a combinatorial 
optimization to compute each term Mij, can often be reduced to a simple-to-check condition. Proofs 
are given in Appendix A (theory) and Appendix B (applications). As a final remark, we should high¬ 
light that submodularity (supermodularity), which is equivalent to AjAif{S) < 0 {AjAif{S) > 0) 
for any i,j € V, i j, S C V, S ^ i,j, is not sufficient to guarantee fast mixing, as displayed 
by the different convergence behaviors of the Glauber dynamics for Ising models with respect to 
different values of the inverse temperature /3. See Mossel and Sly (2013) and references therein. 

Notation. In this paper we adopt the usual vector/matrix notation for distributions, kernels, and 
functions defined on finite sets. Given two finite sets X and Y with respective cardinality |X| and 
|Y|, we interpret a probability distribution p on X as |X| -dimensional row vector, a kernel T from 
X to Y as a matrix in [0, and a function h, : Y ^ M as a |Y|-dimensional column vector. 

Hence, we write pTh := YIxgx yeY y)h{y)- For each x G X, we write to indicate the 

probability distribution Tx : y ^ Y ^ Tx{y) ’■= T{x, y), and for m > 0 we write T™ to indicate 
the m-th power of the matrix T. Given A C Y, we define the indicator function 1a as 1 a (y) := 1 
if y G A, 1a (y) := 0 if y 0 A. Clearly, p(A) = plA. For y G Y, we will also use the notation ly to 
mean l{y}- If F is a finite set, given x = (x*)jey, we write x^ := for S' C F. If x, y G M, 

we use X A y := min{x, y} and x V y := max{x, y}. 
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2. Fast mixing MCMC algorithms for discrete point processes 

Throughout this paper, let F be a finite set with cardinality n := \V\. Let So := {0,1} and define 
S := X jgy So = {0,1}”. Given /3 > 0, we consider fhe following probability disfribufion p on S 

g/3/ty) 

for a: € S, where / : S —>• M is a given bounded function. In the following we will also consider the 
isomorphic description of the function / given by the set function /* : 2^^ —)■ M defined as follows, 
for each S C V: f*{S) := f{x{S)), where x{S) G S is defined as x{Sy = 1 if f G S,x{Sy = 
Wifh an overload of nofafion, henceforth we refer fo /* as /, leaving fo fhe confexf fhe 
determination of what is meant. In this paper we address the problem of probabilistic inference for 
discrete point processes defined in (2), fhaf is, the problem of computing guaranteed approximations 
to marginals probabilities of the type 

G S : X* = 1 Vi G 5}) = IJ'ix), (3) 

for a given S C V. Note that, in general, computing (3) exactly is hard, as computing the normal¬ 
ization function in (2) is #P-complete, see Jerrum and Sinclair (1993). Hence, we need to resort 
to approximation schemes. Our goal is to investigate the properties of the function / that make it 
possible to design time-uniform Markov chains that quickly converge to More specifically, we 
wanf fo design a fransifion kernel T on S such thaf /iT = fi, and so fhaf, for any disfribufion p 
on § and any function /i : S — >^ M, pT^h converges exponenfially fasf fo p,h, as m increases. As 
|S| = 2”, a general fransifion kernel T on § is a mafrix wifh af mosf 2*^(2” — 1) degrees of freedom 
(recall fhaf for each x G S if has fo hold Y^z&T{x, z) = 1). In particular, for each x G S, T^; is a 
disfribufion with 2"^“^ degrees of freedom, which can be difficult to sample from if n is large. To 
avoid this exponential burden with the cardinality of S, we restrict our attention on Markov chains 
that are defined as combinations of local-update probabilify kernels iVW from § fo §o fhaf we as¬ 
sume we can easily sample from (in particular, fhis implies fhaf each iVl*! should nof depend on fhe 
normalization funclion in (2)). Hence, for each z G H define fhe fransifion kernel on S as 

tW(x,^) := (4) 

where, for each i G V, is a probability kernel from S fo §o so fhaf leaves p invarianf, 
namely, /tTW = p. Label fhe elemenfs of F as H = {zi,..., z„}. We consider fhe fwo chains: 


s 

II 

(sysfemafic scan) 

(5) 

\ i€V / 

(random scan) 

(6) 


Markov chains described by fhese type of fransifion kernels are usually referred fo as systematic 
scan (5) and random scan (6) Markov chain Monfe Carlo (MCMC)V Respecfively, fhey give rise fo 
Algorifhm 1 and Algorifhm 2, when fhe chains are run for m > 1 sweeps wifh inifial disfribufion p. 

1. Typically in the literature (see Dyer et al. (2009), for instance) the random scan MCMC sampler is defined as S := 
TO instead of Tr = S" as in (6). Our choice in the present context is motivated by the fact that we 

want to compare random and systematic scan. Note, in fact, that a single application of the kernel Ta in (5) involves 
updating all coordinated ii,..., ty, while a single application of S involves updating (uniformly at random) only one 
coordinate i G V. This is why the right scale to make the comparison involves n iterations of S. 
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Algorithm 1 Systematic scan MCMC sampler 
Sample X € S from the distribution p; 
for fc = 1,..., m do 
for j = 1,..., n do 

Draw from the distribution Kx 
SetX^i ^ 

end 

end 

Output: X = {X^)i^v that is distributed according to pTj". 


Algorithm 2 Random scan MCMC sampler 
Sample X € S from the distribution p; 
for A: = 1,, nm do 

Sample i uniformly; 

r^i 

Draw Z* from the distribution Kx ; 

Set X* ^ Z*; 

end 

Output: X = (X*)jgy that is distributed according to pT™. 


The following theorem describes the convergence behavior of the MCMC algorithms Tg and 
under the so-called Dobrushin uniqueness condition. Since the seminal work in Dobrusin (1970) 
several authors have presented different approaches to establish convergence bounds of the type in 
Theorem 1. We refer to Dyer et al. (2009) and references therein for a review of results that address 
fast mixing within the Dobrushin uniqueness framework. Theorem 1 represents the building block 
for the theory that will be developed in the next sections. 

Theorem 1 (Local-update MCMC algorithms for discrete point processes) For each i,j G V, 
define the Dobrushin coefficients as: 

C„ := ({0)) - ({0))|. (7) 

Let R € such that C < R, element-wise. Assume that the Dobrushin uniqueness condition 

holds, namely, 


||i?||oo := max < 7 < 1. (8) 

i&V ^' 

j£V 

Then, for any distribution p on S, any natural number m > 0, and any function /i : § —> M, we have 

IpTf/i - ph\ E ^9) 

iev iev 


with \ := ^ < 1 and dfih) := max 2 ;gs |/i(x'^\’f®^0*) — 
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Remark 2 (On fast mixing) If 7 in the Dorbushin uniqueness condition ( 8 ) does not depend on 
n, then the results in Theorem 1 imply that the Markov chains defined by the transition kernels Tg 
and Tr are fast mixing. Recall that for each e > 0 the mixing time T(e) of a Markov chain with 
transition kernel T and unique invariant distribution p is defined as (Levin et al., 2008) 


t{s) := min < m > 0 


where \\p — /9^||Ty := max^cs \p{-^) — \ the total variation distance between distributions 

p and p' on S. Clearly, from (9) it follows that \pTf^(A) — p{A)\ < 1 x 7 ™ and \pT)T{A) — p{A)\ < 
nX^,for each ^4 C §. Hence, we can easily derive the following upper bounds for the mixing time 
of the systematic scan and random scan Markov chains, respectively. 


log(ne ^) 

, rr(e) < 

log(ne 

1-7 


1 - A 


which show that the Markov chains are fast mixing, that is, their mixing time is upper bounded by a 
quantity that scales only logarithmically with the size n of the set V.'^ In fact, taking the case of the 
systematic scan Markov chain, for instance, we have 


f log(ne 1 

< m > 0 : m > -^-> < 

log(ne ^) 

1 - log 7 J 

1-7 


where in the last inequality we used that log x < x — 1 for each x € M. 


Remark 3 (On probabilistic inference) If y in the Dorbushin uniqueness condition (8) does not 
depend on n, then Theorem I yields that the inference problem of approximating marginals of 
the type (3) can be efficiently addressed via MCMC methods. In fact, choosing h = 1a, with 
^ = {x G S : X* = 1, i € S}for a certain S (fV, Theorem I provides the following exponentially 
decreasing error bounds, for any m > 0 and any distribution p on S; 

|prr(A) - p{A)\ < |5|7”^, \pTfi^{A) - p{A)\ < |5|A-, 

which do not depend on n. a consequence, if Xi,... ,X]sf are independent random variables 
distributed according to pTf^, generated as prescribed in Algorithm 1 (analogous results follow for 
random scan), then pA ■= jf ^A{Xk) is a biased estimator of p{A) with the typical Monte 

Carlo mean square error bias/variance decomposition: 

E[{pA - p{A))^] = (E[Aa] - t{A))^ + E[{pA - E[pA]r] < (|5|7”^)2 + 

bias^ variance 

'We stress once again that under the current assumptions this error bound does not depend on the 
set size n. This is in sharp contrast to the upper bounds for marginals produced by the theory devel¬ 
oped for the variational methods in Djolonga and Krause (2014); these bounds, in fact, deteriorate 
exponentially with n, as it can be deduced by the very experimental results presented by the authors. 

2. Typically, fast mixing is defined when the mixing time is 0(n log n), not 0(log n). The difference in our setting is 
due to the definitions of the chains Ts and Tr, which involve a full sweep over n variables. 
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Remark 4 (On block updates) Using the results in Rebeschini and van Handel (2014) it is possi¬ 
ble to generalize Theorem 1 for MCMC algorithms with blocks updates, in the spirit ofWeitz (2005), 
that is, when the transition kernels T^, i in (4) are replaced by 

for each block S C V in a given family S satisfying Usg 5 S = V (for instance, 5 = {5 C ; 
|5| = kf, for some k > 1), where is a probability kernel from S to X jg 5 So = {0, 

In this case, for instance, a single step of the block-update random scan Gibbs sampler would 
read ^ which generalizes ^ version of Theorem 1 would 

yield weaker sufficient conditions for fast mixing involving higher values of the parameter /?, in 
the spirit of Dobrushin and Shlosman (1985). However, these conditions would be more convoluted 
and difficult to analyze, and they would involve a control on the “block” derivatives of the form 
AjAif(S), with J CV. As the focus of the present work lies on investigating the basic properties 
of the set function f that makes it possible to efficiently address the inference problem in (2) via 
MCMC methods, we limit our analysis to local sampling schemes with single site updates. 


3. Condition for fast mixing and decay of correlations 

In this section we introduce two of the most popular MCMC algorithms that are used in the liter¬ 
ature within the local-update framework described by (4): Gibbs sampler and Metropolis-Hastings 
(Asmussen and Glynn, 2007). While many variants of these algorithms can also be considered, we 
restrict our analysis to their most basic implementations, as our goal is to investigate the fundamental 
principles behind fast mixing for discrete point processes. For both of these algorithms we compute 
the Dobrushin coefficients (7), and we provide a comparison between them in Lemma 7 below. 
Then, we apply Theorem 1 to the analysis of the Gibbs sampler algorithm, and we present sufficient 
conditions for fast mixing, particularly in the case when / is submodular. Lemma 8 below. Hence¬ 
forth, for each x G S let S{x) C )/ be defined as follows: i G S' if x{Sy = 1, f 0 S' if x(S')* = 0. 


Lemma 5 (Gibbs sampler) 


For each f G C, in the transition kernel TW in (4) choose 


icW(x,/) = e[a:* = 


1 

g/3Ai/(S(x'^\{Ooh) 


ifU = 0 , 

( 10 ) 

ifU = 1, 


where the random variable X has distribution p. Then, for each i €zV we have /tTW = p, and the 
Dobrushin coefficients (7) read, for each i,j G V, 


a 




max 


|g/3Ai/(S) _ g/3Ai/(Su{i})| 


fficv:S^i,j (1 -t- el^^U{S))(i g/3Ai/(Su{i})) 


ifi = j, 
if i + j- 


Lemma 6 (Metropolis-Hastings) 


For each f G C, in the transition kernel TW in (4) choose 




q;*(x) if z^ 7^ X*, 
1 — a*(x) if z^ = X*, 


a*(x) 


I ^ g/3Ai/(S(x'"\{Toi)) 


i/x* = 1, 

i/x* = 0 
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Then, for each i we have 


fi, and the Dobrushin coefficients (7) read, for each i,j ^V, 



max e 
SCV:S^i 

max 

SCV:S^i,j 


g/3Ai/(SU{j}) 


V 




ifi = j, 
ifi + j- 


The implementation of the Gibbs sampler (10) and Metropolis-Hastings (11) is given, respec¬ 
tively, in Algorithm 3 and Algorithm 4 in Appendix A (we only present the random scan versions 
from Algorithm 2; the systematic scan versions can be obtained analogously from Algorithm 1). 

The following Lemma compares the Dobrushin coefficients for the Gibbs sampler. Lemma 5, 
and the Metropolis-Hastings Markov chain. Lemma 6. 


Lemma 7 (Comparison of Dobrushin coefficients) For each i,j €V,if^j, we have 


0 — Cii < Cii , — Cij < Cij < Cij. 


From Lemma 7 it follows that C < C, element-wise. In this respect, it might seem a good idea 
to investigate conditions for fast mixing for the Metropolis Hastings algorithm, i.e., to upper bound 
C, as these conditions will immediately yield fast mixing for the Gibbs sampler as well. However, 
this approach has a main drawback. In fact, while for each f E 1/ it holds that Cu = 0, we have that 
Cii > 0> and for a large system (i.e., for a large value of n = |I/|) we typically expect this quantity 
to be very close to 1, which would create a problem to establish the Dobrushin uniqueness condition 
(8). For instance, take the case where the function / is monotone (i.e., Aif{S) > 0 for each i € V, 
S CV) and submodular. Then, 

max = max 

SCV:S^i scv-.s^i 

Typically, unless / is modular or “close” to modular (in the sense that its curvature is close to zero, 
see Section 4.1), if n is large we expect the marginal gain of / when an element f E L is added to 
V \ {i} to be close to 0, that is, Aif{V \ {i}) « 0, which yields Ca « 1 if /3 « 1. This issue is not 
present in the Gibbs sampler, as Cu = 0 for each z E L. For this reason, henceforth we focus on 
the Gibbs sampler and we provide sufficient conditions for it to be fast mixing. 


Lemma 8 (Fast mixing Gibbs samplers for discrete point processes) Assume that the following 
condition holds: 


a(/3)max > max 
iev scv:S^i,j 

jev\{i} - 


^ _ ^pAjAifiS) 


< 7 < 1 , 


( 12 ) 


where a{P) := maxi^y max^j^yyijj e Let Tg and be the Gibbs samplers defined as 

in Section 2, with i E L, defined as in (10). Then, for any distribution p on §, any natural 
number m >0, and any function h : § we have 


IpTf^h - ph\ < 7 ™ E ^ E 

i£V iev 


8 









Fast Mixing for Discrete Point Processes 


where X := ^ < 1 and 6i{h) := max^^gg If^ does not depend on 

n, then the chains are fast mixing. In particular, if f is submodulaP, then condition (12) reads 


a(/3)max max (i — <-y < 1 

iev , SQV-.S^i,j V / “ ' ’ 


(13) 


ieP\{i} 


with a{P) = maxjgy e ht^if(V\{^})_ jj j monotone, clearly a{/3) < 1. 


Note that using the inequality 1 — e* < —x for each x € M, condition (12) can be replaced by the 
stronger (but simpler) condition a(/3)/3||M||oo < 7 < 1, where Mij := maxscv-.S^ij 
for each i j, and Mu ;= 0 for each i. This condition makes it manifest the property of the function 
/ that renders the inference problem tractable via local-update MCMC methods. As highlighted in 
Remark 2 and also in Remark 3, in order for the Markov chains to be fast mixing we need 7 to 
be independent of the set size n = |y|. In particular, we need ||M||co to be upper bounded by a 
constant that does not depend on n. This is achieved, for instance, if {V, d) is a metric space and the 
function / displays one of the following forms of decay of correlation with respect to the metric d. 

• (Exponential decay of correlations) Assume that for each i, j £ V,i j we have 

Mi,- := max \AjAif(S)\ < 

SCV:S^i,j 

where a, a' > 0 are two constants so that ||M||cxd < maxjgy < c, where c 

does not depend on n. 


• (Finite-range correlations) Assume that for each i,j £ V,i j, there exists r > 0 such that 


Mij 


max 

SCV:S^i,j 


A,-A,/(5)| < 



ifd{i,j) < r, 
if d{i,j) > r, 


where c > 0 does not depend on n. For each i £ V define N{i) := {j £ V : d{i,j) < r}, 
and assume that N := maxjgy |iV(i)| does not depend on n. Then, clearly, ||M||oo < cN. 

In the remaining of this paper we assume that the function / is submodular, and we investigate 
condition (13). In this case, note that the quantity a(/3) involves an optimization over only n values, 
while for each i,j £ V, i j, the term max.sc:V:S^i,j (l “ involves an optimization 

problem over 2”“^ possibilities. As we will see in the next section, however, it is often the case that 
we can compute this term exactly, or that we can easily produce upper bounds for it. 


4. Applications to log submodular point processes 

In this section we apply Femma 8 to a few canonical examples of submodular functions. First, in 
Section 4.1 we discuss the elementary behavior of the Gibbs sampler in trivial (i.e., i.i.d.) models 
defined by modular functions. This case will serve fo poinf ouf anofher deficiency (on fop of fhe 
one already highlighfed in Remark 3) of fhe fheorefical resulfs presenfed in Djolonga and Krause 
(2014) wifh respecf fo fhe notion of curvafure. Consecufively, we consider functions fhaf are defined 

3. As done in Djolonga and Krause (2014) and Iyer and Bilmes (2015), we can also specialize our results to supermod- 
ular functions (/ is supermodular if —/ is submodular). In this case we would get a[P) = maxigv 
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on a finite graph G = {V, E), where n = \V\ and the edge set E describes the pair of vertices 
that interact, Sections 4.2, 4.3, and 4.4. The strength of the pairwise interaction is modeled by a 
symmetric matrix L. The nature of the assumptions that our theory requires to yield fast mixing 
depends on the application at hand, and they involve the structure of the matrix L. Note that for the 
applications discussed in Sections 4.2 and 4.3 these assumptions would be satisfies if Lij < 
or if Lij = 0 whenever d{i,j) > r, in the spirit of the decay of correlation properties discussed in 
the previous section. Finally, in Section 4.5 we consider the case of decomposable functions, that 
is, functions that can be represented as sums of concave functions applied to modular functions. 
Since we need to calculate discrete derivatives, we usually prove submodularity along the way as it 
is nothing but AjAif{S) < 0 for each i,j eV,i^ j, S CV,i,j ^ S. 


4.1. Modular functions 


Let 1/ be a finite set with n = |y|. For each i £ V, let tu* € M be given. For each S C V, 
let f{S) := /(^) 0- immediate to verify that the function / is modular as 

AjAif{S) = 0 for each i,j £V,i ^ j, S EV. Clearly, in this case condition (13) is satisfied with 
7 = 0, and Lemma 8 implies that for each m > 1 we have \pT^h — ^h\ = 0, and \pTJLh — ph] = 0. 
In particular, these results hold even for m = 1, which means that the Gibbs samplers sample exactly 
from the distribution /r in a single sweep. 

We now slightly tweak the trivial example just considered to compare the theoretical guarantees 
provided in Lemma 8 for the Gibbs sampler against the guarantees provided in Djolonga and Krause 
(2014) for the variational methods they proposed, as a function of the curvature c{f) of /, which, 
for monotone submodular functions with /( 0 ) > 0 , is defined as: 


c(f) := 1 — min 


fm) 


Fix k,k' £ V, k k', and let g be the function defined as g{S) := 1 if 5 D {k, k'} ^ 0 and 
p(S') := 0 if S' n {k, k'} = 0 . It is easy to check that g is submodular with 


A,Ai<7(S) 


-1 = {k,k'}, 

0 otherwise, 


for each i,j £ V, i ^ j, S C V, i,j 0 S. For each i £ V \ {k,k'} let Wi = 1, and let 
Wk = Wk' = 0. Consider the submodular function /(S) := + p(S) = |S\{A:, A:'}|+ 5 r(S). 

Then we clearly have c(/) = 1, as the minimum that appears in the definition of curvature is 0 (it 
is attained at i = k, for instance), so that the bounds in Djolonga and Krause (2014) diverge to 
infinity. On the other hand, in this case the results in Lemma 8 hold with 7 = 1 — e~^. In fact, 
a(/3) = maxjgv" = 1 and AjAj/(S) = AjAig{S), from which it follows that 

condition (13) reads 


a(/3) max max fl — =1 — e ^=: 7 < 1 . 

. 77',., SCV:S^i,j V J 


4.2. Facility location 

Let V = {1,..., n} be a collection of facilities, and W = {!,..., m} be a collection of customers. 
Let Lk£ > 0 be the value provided to costumer k £ V by the facility i £ W. For each S C V, 
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let g{S) := and g{0) := 0. By definition, g{S) is the maximum value that the 

facilities in S can provide to all customers, provided that each customer chooses to be served only 
the facility that provides the highest value to him. We consider the function / defined as 

fiS):=g{S)-X\S\, (14) 

where |5| denofes fhe cardinalify of fhe sef S, and A > 0. Such a funcfion is submodular and has 
been considered in many applicafions such as large-scale clusfering (Mirzasoleiman ef ah, 2013) 
and recommender sysfems (El-Arini ef ah, 2009). 

Corollary 9 For the function f in (14), Lemma 8 applies if condition (13) is replaced with 

max — Q-hY,k=i^kif\Lkj'^ < 7 < 1. 

jeniO 


4.3. Generalized graph cut 

Lef F be a verfex sef, and for each i, j ^ V,i j, lef Ljj = Lji > 0 be fhe weighf associafed fo fhe 
undirecfed edge befween i and j. Lef La = 0 for each f € F. For each S CV, lef 

f{S) := a + b'^'^Lki-c'^'^Lki (15) 

kesiev keSieS 

and f{0) = a, wifh a, 6, c > 0 (fypically a is chosen so fhaf f{S) > 0 for any S C F). In fhe case 
a = 0, 6 = c = 1, we recover f{S) = f(V \ S) = Y^k^s Y.eGV\s and /(0) = /(F) = 0, 
which is fhe sfandard graph cuf funcfion. Namely, f{S) is fhe sum of fhe weighfs of each edge 
fhaf connecfs a poinf in S wifh a poinf in F \ 5. The generalized cuf funcfion / defined above is 
submodular, wifh many applicafions in compufer vision (Jegelka ef ah, 2011). 


Corollary 10 


For the function f in (15), Lemma 8 applies if condition (13) is replaced with 

^9{2c-b) Lu g-2c/3Li,.^ < 7 < 1. 


ienw 


4.4. Determinantal point processes 

Fix n and let L € be a positive definite matrix. Let F = {1,..., re}, and for each S' C F let 

/(S) := logdetLs (16) 

and f{0) := 0, where Ls '■= {Lij)ij^s- Such a function is submodular and has been used in deter¬ 
minantal point processes for which the partition function can be computed exactly in time O(re^). 
Recently, several authors have considered MCMC algorithms to sample from determinant point 
processes, see Shah and Ghahramani (2013) and Kang (2013) for instance. However, to the best 
of our knowledge, no theoretical guarantees have ever been established for the MCMG^ algorithms 
being adopted. As mentioned in the introduction, the proofs in Kang (2013) are wrong (although 
the sampling scheme is correct, exactly matching Algorithm 4 in the case = 1). 

4. There are efficient sampling schemes that are not based on the MCMC paradigm. See Deshpande and Rademacher 
(2010) and discussion therein, where they explicitly pose the problem of designing MCMC samplers. 
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In this section we give a probabilistic interpretation of the mechanism behind the structure of the 
matrix L that guarantees fast mixing MCMC algorithms in the light of the theory being developed. 
Let L be the covariance matrix of a collection of Gaussian random variables ,..., X'^. It can be 
shown that, for i, j ^V,i j, S CV,i,j 0 S, we have Cov(X*, := Lij — Li^sL^^Lsj- 

Define the conditional correlation coefficients as follows, for i,j € V, S CV: 

Cov{X\X^X^) 

on, 1 5) = — , - — , , - , 

VVar(X* \XS)y/Xar{X^ |X^) 

and let I{X^; X^\X^) be the conditional mutual information of X^ and X^ given X^. It holds: 

I{X^-,X^\X^) = -ilog(l - pii,j\S)^). (17) 

We now show how condition (13) in Lemma 8 can be stated in terms of the quantities just introduced. 


Corollary 11 For the function f in (16), Lemma 8 applies with the condition (13) that reads 
1 


max-^ ^ _ max > max 

i&v Var(X* X'^\W)/3 scv-.s^pj 


('l _ g-2/3/(xyx^jx«)^ <^<1. 


If 13 = 1, the condition is < 7 <: 


Practitioners who are interested in fast mixing Gibbs samplers for determinant point processes 
need to exploit additional structure in the matrix L to simplify the conditions in Corollary 11 . 


4.5. Decomposable functions 

Given a finite set V, let 5 be a collection of subsets of V that covers V, i.e., S C 2^ with IJasS ^ ~ 
V. For each A ^ S lot fA '■ 1^+ —M be a concave function. For each S FV, let 

/(5):=^<^^(|^n5|). (18) 

The function / is submodular (Stobbe and Krause, 2010). For instance, note that if 5 = {S' C C : 
|S| = 2} and for each ^ G S we have 0^(0) = </>a(2) = —J/n, = J/n, for a given constant 

J > 0, then (2) corresponds to the distribution of an antiferromagnetic mean-field Ising model wifh 
inverse femperafure (3 and zero exfemal magnetic field. 


Corollary 12 For each A ^ S let the concave function (pA be twice differentiable. Assume there 
exist constants c' < 0 < c such that df)A{x)/dx > c and c' < d?‘(j)A{x)/dx^ < Ofor all x G [0, n] 
and yl G S. Then, for the function f in (18), Lemma 8 applies if condition (13) is replaced with 


(1 _ gC'/^-jg-c/^miniev |{AG<S:A3i}| 

^ iev 


u ^ 

A&S-.Abi 


< 7 < 1. 
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Appendix A. Theory, proofs 

Below are the proofs of the results presented in Section 2 and Section 3. 

Proof [Proof of Theorem 1] The proof is based on the Wasserstein matrix approach, which is a 
standard tool in the analysis of high-dimensional Markov chains, cf. Edllmer (1979). Eor each two 
distributions p, p' on So, and function 5 : So ^ M, we have 

\pg-p'g\<m-g{mpm)-p'm)i 

from which it follows that, for each i,j € V, 

Sj(Kf^^g) = max|A:Wp(x^\^^>0^') - < \g{D) - g{l)\Cij. 
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For each function /i : S — M, i G F, x G S, define the function /i^ : z* G M —> h\.{z'^) := 
Then, we have 




< max liT 


3;€S 


VXWOJ xV\{j}03 




*J r* I 


h\ 






-iT 


h\ 


xV\U}v 


+ max liT 


xSS 




-/l* 


'a;I^\ 0 }p 


)l 


< max5j(iTW/ip + maxmax|/i^^\{,jQ,.(z) - ( 2 )! 


where in the last line we applied the previous bound, and we use the notation := 1 if i 7 ^ j, 
:= 0 otherwise. It is convenient to rewrite the previous estimate using a matrix notation: 

5j{T^^h)<Y,5k{h){W^%j, (19) 

k€V 


where for each i G I^, the matrix VFW g is defined as wj^j := lk=iCkj + ^k^i^k=j- The 

matrix VFI*! is a Wasserstain matrix for the transition kernel tI*!. 

First, we apply estimate (19) to bound the systematic scan Markov chain. Iterating this estimate, 
we immediately find 6j{Tsh) < Yhk&v ^k{h,){Ws)kj, where Wg '■= VFI*"! • • • FF[*i]. Given any two 
distributions p and p' on §, by a telescoping argument we find, for any m > 0, 

\pT^h - p'T^h\ < max \T^h{x) - rr/i(z)| < 5,(rr/i) < ^ 5,{h) 

jev iev j£V 

ieP i£V iev 

where for the last inequality we used that ||lFs||oo < lie'll 00 < 7 < 1 (this fact is proved in Corollary 
24 in Dyer et al. (2009); note that the || • ||i matrix norm in the authors’s notation corresponds to the 
II • llcxD matrix norm in our notation). 

We now apply estimate (19) to bound the random scan Markov chain. As an intermediate step, 
define S := ^ Yliev 

SjiSh) < i ^5,(rWh) <IYY1 ^k{h){W^%j = Y^iih)Zij, 
iev iev kev iev 

where Z := G I being the identity matrix. Iterating this result n times, we 

find 6j{Trh) < X^jgy ()i(F)(VFr)ij, where Wr = = (^C + ^^/)”. By the same argument 

presented above, we get 

( 1 \ nm 

1 ^ E ^ E 

^ ^ i&V i&V 

The proof is concluded if we take p' to be p, noticing that pTg = pT^ = p- M 
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Proof [Proof of Lemma 5] Each kernel leaves /r invariant as a consequence of the tower property 
of conditional expectations as, for each function h, : S —M we have 

/tTW/i = E[tW/i(X)] = E[E[/i(X)|X^\W]] = B[h{X)] = /jh. 

For i,j € V, i ^ j, x G S, we have 


The proof is concluded taking the maximum over x G S on both hand sides. ■ 


1 1 

|g/3Ai/(5(a;nP,i}omJ)) _ g/3Ai/(S(xnO,j}o»P)) | 

(1 + g/3Ai/(S(a:V\{i,i}oiOt))Wx + g/3Ai/(5(xn{ij}ont))) ' 


Proof [Proof of Lemma 6] The fact that each kernel leaves /r invariant follows as a consequence 
of the so-called detailed balance equation, which holds for each x G S, y® G So and is immediately 
verified (once noticed that q;®(x) = 1 A ^ 

y(x)iTf®l(x,y®) = y(x'^^'^®^y®)ftrf®l(x^''’*-®^y®, X®). 


In fact, this equation yields, for each function /i : S —)> M, 

/tTW/i = ^ ^ /i(x)iTW(x,y®)/i(x^'^'^®^y®) 
xGS j/*GSo 

= ^ ^ y(x^\«y®)iTI®](x^\«y^x®)/i(x^\«y®), 
*€§ 3/*g§o 


and the right-hand side of this expression is equal to /r/i, as clearly iv:W(x^\Wy^x®) = 1. 

For z G E, X G S, we have 

l*^k(.>„.({0)) - A-[K,.„.({0))| = |1 - 

For f, j G E, z / j, X G S, we have 


Hence, the Dobrushin coefficients (7) read 


xGS 

max e 
SCV:S^i 

max 

SCV-.S^iJ 


-/3|Ai/(S)| 

^ g/3Ai/(S) g/3Ai/(SU{i}) g-/3Ai/(S) _ g-/3Ai/(SU{i}) 


if z = j, 
if Z j. 
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Below are the algorithms considered in Section 3. 


Algorithm 3 Random scan Gibbs sampler 
Sample S <ZV from the distribution p; 
for k = 1,..., nm do 

Sample i €V uniformly; 

Draw C € {0,1} with P(C = 0) = i 

If C = 0 then set 5 ^ 5 \ {i}, else set 5 ^ S' U {i}; 

end 

Output: SCI/ that is distributed according to pT™. 


Algorithm 4 Random scan Metropolis-Hastings 
Sample SCI/ from the distribution p; 

for k = 1,..., nm do 

Sample i ^ V uniformly; 

if i G S then 

I draw C € {0, 1} with P(C = 0) = 1 A ; if C = 0, then set S ^ S \ {i}; 

else 

I draw C G {0, 1} with P(C' = 1) = 1 A if C = 1, then set S ^ S U {i}; 

end 

end 

Output: SCI/ that is distributed according to pT^. 


Proof [Proof of Lemma 7] The first statement is trivial as Ca = 0 and Cu > 0 by definition. To 
prove the second statement, we show that the following holds for each pair of real numbers o, b with 
b < a: 

1 e“ - 

-h(a,b) < — ---^ < h(a,b), 

4 ^ ^ “ (1 + e“)(l + e^) - ^ 

where h{a, 6) := (1 A e“ — 1 A e^) V (1 A e~^ — 1 A e““). We distinguish three cases. 

If 0 < 6 < a, then h{a, b) = e~^ — e~°‘, and we have 


\h{aM= 4 


e ^ — e ^ 


< 


e — e 


< 


e — e 


4e“e^ “ (1 + e“)(l + e^) “ e“e^ 

If 6 < 0 < a, then h{a, 6) = (1 — e^) V (1 — e““) = 1 — As 

1 _ ^ _ g2((-a)Ab) 


= e — e “ = h{a, b). 


< 


2 _|_ g(—a)Afe—(—a)Vfe 


< I — g(-a)Afe+(-a)Vfe _ 2 _ g-“+^^ 


where for the second inequality we used that < 1 — for x < p < 0, it follows that 

1 ll_g(-a)Afe_i l_g2((-a)A6) ^ 1-6"“+^ _ e“ - 

-h(o, ) _ 2 1 + g(-a)vfe “ 2 (1 + e-“)(l + e'’) “ (1 + e-“)(l + e^) “ (1 + e“)(l + e^)' 
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Moreover, 


e — e 


I _ g-a+b _ g2((-a)Afe) _ g{-a)Ab 

(l + e“)(l + e^) “ (l + e-“)(l + e^) “ (1 +~ 1 + e{-a)vb - 


Finally, if 6 < a < 0, then h{a, 6) = e“ — e^, and we have 


1 ptt _ pO ptt _ pO 


< e“ — = h{a, b). 


Appendix B. Applications, proofs 

Below are the proofs of the results presented in Section 4. 


Proof [Proof of Lemma 8] For each i,jeV, define 


R 


V 


0 


a(B) max 
' SCV:S^i,j 


^ _ g/3A,Ai/(S) 


if i = j, 

if i / j, 


with a{f5) := maxjgy max^yy^^jj g-/3Ai/(5) Lemma follows immediately 

from Theorem 1 and Remark 2, once we prove that C < R, element-wise, where C is the Dobrushin 
matrix defined in Lemma 5. In fact, condition (8) yields 


\\R\\oo 


max > Rij < a(B) max > max 
i£V ^ i€V ^ SCV:S^i,j 

jev jev\{i} 


^ _ g/3AjAi/(S) 


< 7 < 1 , 


which corresponds to (12). For each f € 17 we clearly have Cu = Ru = 0. Henceforth, fix i,j € 17, 
i / j. As 


|g/3Ai/(S)_g/3Ai/(5U{i})| 


< |e 


-/SAif{Su{j}) _ g-/3Ai/(5) I ^ ^-l3Aif{SU{j}) 


= e 


1 —e' 


fiAjAifiS) 


taking the maximum over S C V, S ^ i,j, on both hand sides immediately yields Cij < Rij. The 
proof of the Lemma is concluded once noticed that / being submodular means that AjAj(5) < 0 
for each i,j & V, i j, S C V, S ^ i,j, and it implies Aj/(17 \ {f}) < Aj/(5) for each i € 17, 
5 C F, 5 ^ i, so that a{/3) = maxjgy ■ 


Proof [Proof of Corollary 9] For each i ^ V, S C V, i ^ S, v/e have Ajp(5) = Yl'k=i^Rki — 
max£g 5 Lki) V 0, and, clearly, Aif{S) = Aig{S) — A. The function g is submodular as, for each 
i,j G F, f / j, S CV,i,j ^ S, we have 


A,g{SU{j}) = '£iLki 

k=l 


m 

max Lup) V 0 < > {Lui — maxL^^) V 0 

i&s ' 


A*p(5), 
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so also the function / is submodular, as AjAif{S) = AjAig{S) < 0. Note that for each i,j € V, 
i ^ j, S C V, i, j ^ S, we can write 


AjAif{S) = Y^{Lki - ^ max^Lfc^) V 0 - - maxL^) V 0 

m / 

E ( Lfci A max Lm - Lki A max Lm 


k=l 


e&s 


eGSu{j} 


^ f Lfci A maxLfc^ - Lj^i A {L^j V rnaxL^.^) j 
k=i ^ ^ ^ ^ 

Y1 \ - Lki) V + ((maxL^ - Lkj) V I , 

k=l ^ ^ 


where we used, in order, the following two equalities holding for any real numbers x, y, z: 

(x - y)y 0 = X - y Ax, xAz-xA{yyz) = {{z - x) V 0) lx<:y + {{z - y) y 0) lx>y 

From the expression above it is clear that for each i,j ^ V, i ^ j, we have AjAif{S) < 
AjAif{S') if S C S' C V, from which it follows that min 5 cy:S^i,j ^jAif{S) = AjAj/(0) = 
— Yl^=i Lki A Lkj. As minjgy Aif{V \ {i}) > —A, the left hand side of (13) is upper bounded by 

max fi _ ^-^J2k=i^ki^Lkj 

i&v ^ V 

ieni*} 


Proof [Proof of Corollary 10] It is east to check that for each i,j & V, i ^ j, and S' C 1/ so that 
i,j 0 S we have Aj/(S) = 6 and Aj Aj/(S) ~ —‘^.cLij < 0, from which 

it follows that / is submodular. As minjgy Aif{V \ {i}) = {b — 2c) min^gy X^£gy\{j} Li£, the left 
hand side of (13) is upper bounded by 

^I3{2c-b) minigv E^gv\{i} A _ ^-2cyLij 

i&V ^ V 
iey\{i} 


Proof [Proof of Corollary 1 1] For i ^ S C V we have 

det Lsuii} = {det Ls){Lii - Li^sLg^Lsj), 
where Li^s ■= {Lij)j£S € and Lsj := {Lji)j^s € It follows that 

/(S U {z}) = logdetL^uji} = f{S) + \og{Lii - U^sL^^Lsj), 
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and the marginal gain reads Aj/(5) = \og{Lii — Li^sL^Lg^i) = log Var(X*|X'^). Analogously, 
we find Aif{S U {j}) = log(Lii - Li^su{j}Lglj^jyLsu{j},i)- 


if 


L 


-1 


Sufi} 

with d := Ljj — Lj^gLg^Lgj, B := + ^Lg^Lg^jLj^sBg^, and C := —^Lg'^Lsj. We get 


B C 




Recall that 


1 r-l] 


-^i,5U{i}-^5(j{i}'^SU{i},i {.Li^S^ Ljij') ( qT ]_ 


B C 


=Li^sBLs,^ + 2LjiL,^sC + ^, 


L 




where we used that Lij = Lji, and that LijC'^Lg^i = Lij{L'g ■C)'^ = Lji{Li^sC)'^ = BjiLi^gC. 

Consequently, 


^i,Su{i}-^Su{i}-^Su{i},i 


- Li^s { Lg^ + -Lg'^Ls,jLj^sLg'-J Ls,i + 2LjiLi^s { ) + 

{Lij — Li^gLg Lgj) 


1 


-1 


1 


-1 


r2 


- Li^sLg^Ls,i+ 1 

^hS^S ^S,j 

where we used that Lj^sLg^Lgj = (Lj^sLg^Lsj)'^ = L^^^{Lg^)'^Ljg = Li^s{Ls)~^Lsj = 
Li^sLg^Lgj. Therefore, we find 

= . log(l -p(i,i|S)i. 

As 0 < p{i, j|S) < 1 for each i, j ^ V, S C V, it follows that Aj Aj/(S') < 0 and / is submodular. 
From (17) we have AjAif{S) = —2I{X^;X^X^), and the left hand side of (13) is equal to 

1 


max: 


■ max 


iev Var(A*|A^\{*})S iev ,^^^^scv-.s^i,j 




The case /3 = 1 follows immediately using (17). 


Proof [Proof of Corollary 12] First note that since d(j)A{x) jdx > c for each x € [0, n] it follows that 
the discrete differences are all bounded, i.e., (j)A{x + 1) — (t)A{x) > c for each x € {0,..., n — 1} 
and A € 5. Similarly, since d?(j)A{x)/dx"^ > d for each x € [0, n] we have (I)a{x + 2) — 24>a{x + 
1) + <t>A{x) > d for each x G {0,. .., n — 2} and A ^ S. Now, for each i&V,SCV,i^S we 
have 

^^fiS)= Yl WA(|sln(5u{f})|)-</)A(|An5|)}, 

AG<S:A3i 

and each i,j G 1/, f / j, S CV,i,j ^ S we have 

A,AJ(5) 

= Y n (5 u {i,j})\)-(l)A{\A n (5 u {j})\)-4>A{\A n (S’ u {z})|)+0a(|^ n S|)} 

AeS-.A5i,j 

= Y^ {</’a(|^ n S'! + 2 ) — 2(f)A{\A n s| +1) + </>a(|^ n S'!)} < o, 

AeS-.A5i,j 
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where the inequality comes from the concavity of each (j)A, as (I)a{x + 2) — 2(f>A (x +1) + (/>yi (x) < 0 
for X > 0. As AjAif{S) < 0, / is submodular. In particular, note that if i,j are such that there is 
no A G 5 that satisfies A 3 i,j, then the previous expression yields AjAif{S) = 0. As for each 
z G 1/ we have Aif{V \ {z}) = EagSMsi - 1)} > c\{A G 5 : A D z}|, and 

for each i, j € V, i ^ j, S C V, i, j ^ S we have AjAif{S) > c' , then the left hand side of (13) 
is upper bounded by 

(1 - |{AG5:A3i}| I r ■ g y \ {A . g ^ foj. ^ome A G S}\ 

zgf 

< (1 - |{Ag5:A3I}| 

i&V 


U -4 

AeSiAsi 
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